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Abstract 



• Online learning algorithms have impressive convergence properties when it comes to risk 

minimization and convex games on very large problems. However, they are inherently 
sequential in their design which prevents them from taking advantage of modern multi-core 
architectures. In this paper we prove that online learning with delayed updates converges 



■ well, thereby facilitating parallel online learning. 

1. Introduction 

Online learning has become the paradigm of choice for tackling very large scale estimation 
problems. Their convergence properties are well understood and h ave been analyzed i n 



a number of different fram e works such as by means of asvmpto tics (IMurata et all 



game theory (jHazan et all 120071 ). or stochastic programming (jNesterov and Vial 



1994), 



2000). 



^j- , Moreover, learning -theory guarantees show that 0(1) passes over a datase t suffice to obtain 

O optimal estimates (IBottou and LeCun . 120041 : IBottou and Bousquetl . 120071 ) . All those prop- 



erties combined suggest that online algorithms are an excellent tool for addressing learning 
problems. 

This view, however, is slightly deceptive for several reasons: current online algorithms 
process one instance at a time. That is, they receive the instance, make some prediction, 
incur a loss, and update an associated parameter. In other words, the algorithms are 
entirely sequential in their nature. While this is acceptable in single-core processors, it is 
highly undesirable given that the number of processing elements available to an algorithm 
is growing exponentially (e.g. modern desktop machines have up to 8 cores, graphics cards 
up to 1024 cores). It is therefore very wasteful if only one of these cores is actually used for 
estimation. 

A second problem arises from the fact that network and disk I/O have not been able to 
keep up with the increase in processor speed. A typical network interface has a throughput of 
lOOMB/s and disk arrays have comparable parameters. This means that current algorithms 
reach their limit at problems of size 1TB whenever the algorithm is I/O bound (this amounts 
to a training time of 3 hours), or even smaller problems whenever the model parametrization 
makes the algorithm CPU bound. 

Finally, distributed and cloud computing are unsuitable for today's online learning al- 
gorithms. This creates a pressing need to design algorithms which break the sequential 
bottleneck. We propose two variants. To our knowledge, this is the first paper which 
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provides theoretical gu arantees combined wi t h em pirical evidence for such an algorithm. 
Previous work, e.g. bv lDelalleau and Bengiol (|2007l ) proved rather inconclusive in terms of 
theoretical and empirical guarantees. 

In a nutshell, we propose the following two variants: several processing cores perform 
stochastic gradient descent independently of each other while sharing a common parameter 
vector which is updated asynchronously. This allows us to accelerate computationally inten- 
sive problems whenever gradient computations are relatively expensive. A second variant 
assumes that we have linear function classes where parts of the function can be computed 
independently on several cores. Subsequently the results are combined and the combination 
is then used for a descent step. 

A common feature of both algorithms is that the update occurs with some delay: in 
the first case other cores may have updated the parameter vector in the meantime, in the 
second case, other cores may have already computed parts of the function for the subsequent 
examples before an update. 



2. Algorithm 



2.1 Platforms 

We begin with an overview of three platforms which are available for parallelization of algo- 
rithms. The differ in their structural parameters, such as synchronization ability, latency, 
and bandwidth and consequently they are better suited to different styles of algorithms. 
This description is not comprehensive by any means. For instance, there exist numer- 
ous variants of communication paradigms for di stributed and cloud co mputing ranging 
from fully independent Folding@Home algorithms (IShirts and PandeLkoOch to sophisticated 



pipelines like the Drayad architecture (jlsard et all 120071 ) . 



Shared Memory Architectures: The commercially available 4-16 core CPUs on servers 
and desktop computers fall into this category. They are general purpose processors 
which operate on a joint memory space where each of the processors can execute 
arbitrary pieces of code independently of other processors. Synchronization is easy via 
shared memory/interrupts/locks. The critical shared resource is memory bandwidth. 
This problem can be somewhat alleviated by exploiting affinity of processes to specific 
cores. 

A second example of a shared memory architecture are graphics cards. There the 
number of processing elements is vastly higher (512 on high-end consumer graphics 
cards), although they tend to be bundled into groups of 8 cores (also referred to as 
multiprocessing elements), each of which can execute a given piece of code in a data- 
parallel fashion. An issue is that explicit synchronization between multiprocessing 
elements is difficult — it requires computing kernels on the processing elements to 
complete. This means that an explicit synchronization mechanism may be undesirable 
since it comes at the expense of a large performance penalty or a significant increase in 
latency. Implicit synchronization via shared memory is still possible. Critical resources 
are availability of memory: consumer grade graphics cards have in the order of 512MB 
high speed RAM per chip. Communication between multiple chips is nontrivial. 
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Clusters: To increase I/O bandwidth one can combine several computers in a cluster 
using MPI or PVM as the underlying communications mechanism. A clear limit here 
is bandwidth constraints and latency for inter-computer communication. On Gigabit 
Ethernet the TCP/IP latency can be in the order of 100/xs, the equivalent of 10 5 clock 
cycles on a processor and network bandwidth tends to be a factor 100 slower than 
memory bandwdith. Infiniband is approximately one order of magnitude faster but it 
is rarely found in off-the-shelf server farms. 



Grid Computing: Computational paradigms such as Map Red uce (Chu et al. . 20071 ) are 



well suited for the parallelization of batch-style algorithms ( Teo et al. . 20091 ). In com- 
parison to cluster configurations communication and latency are further constrained. 
For instance, often individual processing elements are unable to communicate directly 
with other elements with disk / network storage being the only mechanism of inter- 
process data transfer. Moreover, the latency is significantly increased, typically in the 
order of seconds, due to the interleaving of Map and Reduce processing stages. 

Of the above three platform types we will only consider the first two since latency plays a 
critical role in the analysis of the class of algorithms we propose. While we do not exclude 
the possibility of devising parallel online algorithms suited to grid computing, we believe 
that the family of algorithm proposed in this paper is unsuitable and a significantly different 
synchronization paradigm would need to be explored. 



2.2 Delayed Stochastic Gradient Descent 

Many learning problems can be written as convex minimization problems. It is our goal to 
find some parameter vector x (which is drawn from some Banach space X with associated 
norm ||-||) such that the sum over convex functions fi : X — > M. takes on the smallest 
value possible. For instance, (penalized) maximum likelihood estimation in exponential 
families with fully observed data falls into this category, so do Support Vector Machines 
and their structured variants. This also applies to distributed games with a communications 
constraint within a team. 

At the outset we make no special assumptions on the order or form of the functions 
fi. In particular, an adversary may choose to order or generate them in response to our 
previous choices of x. In other cases, the functions fi may be drawn from some distribution 
(e.g. whenever we deal with induced losses). It is our goal to find a sequence of Xi such that 
the cumulative loss Y^i fi( x i) i s minimized. With some abuse of notation we identify the 
average empirical and expected loss both by /*. This is possible, simply by redefining p(f) 
to be the uniform distribution over F. Denote by 

i 

and correspondingly x* := argmin/*(x) (2) 

the average risk. We assume that x* exists (convexity does not guarantee a bounded 
minimizer) and that it satisfies ||x*|| < R (this is always achievable, simply by intersecting 
X with the unit-ball of radius R). We propose the following algorithm: 
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Algorithm 1 Delayed Stochastic Gradient Descent 

Input: Feasible space X C R n , annealing schedule r\t and delay r £ N 
Initialization: set x\ . . . ,x T = and compute corresponding gt = V/t(xt). 
for t = t + 1 to T + r do 

Obtain /t and incur loss ft(xt) 

Compute g t := Vf t (x t ) 

Update xt+i = argmin^g^ ||x — (xt — rjtgt- T )\\ (Gradient Step and Projection) 
end for 



In this paper the annealing schedule will be either rjt = a (t- T ) or Vt = ^t-r ' Often, 
X = W 1 . If we set r = 0, algorithm [1] becomes an entirely standard stochastic gradient 
descent algorithm. The only difference with delayed stochastic gradient descent is that 
we do not update the parameter vector xt with the current gradient gt but rather with 
a delayed gradient gt- T that we computed r steps previously. Later we will extend this 
simple stochastic gradient descent model in two ways: firstly we will extend the updates to 
implicit updates as they arise from the use of Bregman divergences (see Section [5]), leading 
to variants such as parallel exponentiated g radient descent . Secondly, we wi l l mod ify bounds 
which are dependent on strong convexity dBartlett et all 120081 : bo et all I2009T ) to obtain 
adaptive algorithms which can take advantage of well-behaved optimization problems in 
practice. 



2.3 Templates 

Asynchronous Optimization Assume that we have n processors which can process data 
independently of each other, e.g. in a multicore platform, a graphics card, or a cluster of 
workstations. Moreover, assume that computing the gradient of ft{x) is at least n times 
as expensive^ as it is to upda te x (read, add, wri t e). This occurs, for instanc e, in the case 



of conditiona l rand om fields ( Ratliff et al. . 2007 : Vishwanathan et all 20061 ). in planning 



(jBatliff et all bOQfil ). and in ranking (|Weimer et all l2008h . 

The rationale for delayed updates can be seen in the following setting: assume that we 
have n cores performing stochastic gradient descent on different instances ft while sharing 
one common parameter vector x. If we allow each core in a round-robin fashion to update 
x one at a time then there will be a delay of r = n — 1 between when we see ft and 
when we get to update xt+ T . The delay arises since updates by different cores cannot 
happen simultaneously. This setting is preferable whenever computation of ft itself is time 
consuming. 

Note that there is no need for explicit thread-level synchronization between individual 
cores. All we need is a read / write-locking mechanism for x or alternatively, atomic updates 
on the parameter vector^ This is important since thread synchronization on GPUs is 



More fine-grained variants are possible where we write only parts of the parameter vector 1 at a time, 
thereby requiring locks on only parts of x by an updating processor. We omit details of such modifications 
as they are entirely technical and do not add to the key idea of the paper. 

There exists some limited support for this in the Intel Threading Building Blocks library for the x86 
architecture. 
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Figure 1: Data parallel stochastic gradient descent with shared parameter vector. Obser- 
vations are partitioned on a per-instance basis among n processing units. Each 
of them computes its own loss gradient g% = d x ft(xt). Since each computer is up- 
dating x in a round-robin fashion, it takes a delay of r = n — 1 between gradient 
computation and when the gradients are applied to x. 



rudimentary at best. Keeping the state synchronized by a shared memory architecture is 
key. 

On a multi-computer cluster we can use a similar mechanism simply by having one 
server act as a state-keeper which retains an up-to-date copy of x while the loss-gradient 
computation clients can retrieve at any time a copy of x and send gradient update messages 
to the state keeper. Note that this is only feasible whenever the message size does not 
exceed - of the bandwidth of the state-keeper. This suggests an alternative variant of the 
algorithm which is considerably less demanding in terms of bandwidth constraints. 

Pipelined Optimization The key impediment in the previous template is that it re- 
quired significant amounts of bandwidth solely for the purpose of synchronizing the state 
vector. This can be addressed by parallelizing computing the function value fi(x) explicitly 
rather than attempting to compute several instances of fi(x) simultaneously. Such situa- 
tions occur, e.g. when fi(x) = g(((j)(zi),x)) for high-dimensional 4>(zi). If we decompose 
the data zi (or its features) over n nodes we can compute partial function values and also 
all partial updates locally. The only communication that is required is to combine partial 
values and to compute the gradient with respect to (<p(zi),x). 

This causes delay since the second stage is processing results of the first stage while the 
latter has already moved on to processing f t +i or further. While the architecture is quite 
different, the effects are identical: the parameter vector x is updated with some delay r. 
Note that here r can be much smaller than the number of processors and mainly depends on 
the latency of the communication channel. Also note that in this configuration the memory 
access for x is entirely local. 

Randomization Order of observations matters for delayed updates: imagine that an 
adversary, aware of the delay r bundles each of the r most similar instances ft together. In 
this case we will incur a loss that can be r times as large as in the non-delayed case and 
require a learning rate which is r times smaller. The reason being that only after seeing 
r instances of ft will we be able to respond to the data. Such highly correlated settings 
do occur in practice: for instance, e-mails or search keywords have significant temporal 
correlation (holidays, political events, time of day) and cannot be treated as iid data. 
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A simple strategy can be used to alleviate this problem: decorrelate observations by 
random permutations of the instances. The price we pay for this modification is a delay in 
updating the model parameters (there need not be any delay in the prediction itself) since 
obviously the range of decorrelation needs to exceed r considerably. 

3. Lipschitz Continuous Losses 

We begin with a simple game theoretic analysis that only requires ft to be convex and where 
the subdifferentials are bounded ||V/j(x)|| < L by some L > 0. Denote by x* the minimizer 
of f*(x). It is our goal to bound the regret R associated with a sequence X = {xi, . . . , xt} 
of parameters 

T 

R[X]:=J2ft(xt)-ft(x*). (3) 
t=i 



Such bounds can then be converted into bounds on the expected loss. See e.g. (jShalev-Shwartz et al 



20071 ) for an example of a randomized conversion. Since all ft are convex we can upper bound 



R[X] via 

T T 

R[X] <Y,(Vft(xt),x t -x*) = J2(9t,xt-x*). (4) 
t=i t=i 

Next define a potential function measuring the distance between xt and x* . In the more gen- 
eral analysis this will become a Bregman divergence. We define D(x||x') := ^ ||x — x'\\ 2 . To 
prove regret bounds we need the following auxiliary lemma which bounds the instantaneous 
risk at a given time: 

Lemma 1 For all x* and for all t > r, if X = M n , the following expansion holds: 
, . v 1 „ ,|2 , D(x*\\xt)-D(x*\\x t+l ) mi ^ +1 » 

(5) 

Furthermore, |5|) holds as an upper bound if X C W 1 . 

Proof The divergence function allows us to decompose our progress via 

1 2 1 2 

D(x \\xt+i) - D(x \\xt) = - \\x - xt + xt - xt+i\\ --^\\x*-x t \\ (6) 

1 ii * „2 1 i, * „2 

= - ||x - x t + r/tgt-r || - - \\x - x t \\ (7) 
= \\9t-A 2 ~ Vt (x t ~ x*,g t - T ) (8) 

= ^Vt \\9t-r\\ 2 ~ Vt (Xt-r ~ X*,g t - T ) ~ Vt (Xt ~ Xt- T ,gt-r) (9) 

We can now expand the inner product between delayed parameters (xt — xt- T , fft-r) i n 
terms of differences between gradients. Here we need to distinguish the initialization: for 
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r < t < 2r we only obtain differences between t — r gradients, since the optimization 
protocol initializes xt = x\ for all t < r. This yields 

min(t-(r+l),r) min(t-(r+l),r) 
(x t - x t _ T ,g t ^ T ) = ^ ( x t-(j-i) ~ xt-j,gt-r) = Vt-j {9t-T-j,9t-r) (10) 

3=1 3=1 

Plugging the above into Q, dividing both sides by r\ t and moving (xt- T — x *,gt-r) to the 
LHS completes the proof. 

To show that the inequality holds note that distances bet ween vectors can only decrease 
if we project onto convex sets. The argument follows that of Zinkevich ( 20031 ). ■ 

Note tha t the decomposit ion © is very similar to standard regret decomposition bounds, 
such as dZinkevichl . l2003h . The key difference is that we now have an additional term 
characterizing the correlation between successive gradients which needs to be bounded. In 
the worst case all we can do is bound (g t - T -j,gt- T ) < L 2 , whenever the gradients are highly 
correlated, which leads to the following theorem: 

Theorem 2 Suppose all the cost functions are Lipschitz continuous with a constant L and 
maXx iX ' & x D(x\\x') < F 2 . Given r)t = ^° = for some constant a > 0, the regret of the 
delayed update algorithm is bounded by 



T ^OT 2 



R[X] < aL 2 Vf + F 2 — + L 2 + 2L 2 cttVt 

a 2 



(11) 



and consequently for a 2 = 2 ^ and T > t 2 we obtain the bound 

R[X] < AFLV^ (12) 
Proof Before we prove the claim, we briefly state a few useful identities concerning sums. 

n(n + 1) 



i=i 



b i r b i 
V — — < / — -=dx = Vb- yja-l < Vb-a + 1 
£^ 2Vi ~ J a -! 2^ 



(13) 
(14) 



Summing over (J5]) and using Lemma [T] yields the inequality 

E, * \ ^ 1 I. ii 2 . u \ x \\ x t) — u \ x \\ x t+i) . I v 

(X t -T ~ x , 9t-r) < 2^ -jptWi-rW H V 2^ nt-j{9t-T-j,9t-r) 

t=T+l t=T+l ^ j=l 



t=T + l 
T+T 

£ 

t=T+l 



min(r,i-(r+l)) 



Vt \\9t- 



-r\\ 2 + ^2 Vt-j (9t-T-j,9t-r) 



3=1 



T+T 



| D(x*\\x T +{) D(x*\\x t +t+i) | 



Vt+1 



VTA 



t=T+2 



D(x*\\x t )[- 



Vt Vt-i 

(15) 
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By the Lipschitz property of gradients and the definition of r\t we can bound the first 
summand of the above risk inequality via 



T+T 



E - 

t=T+l 



T+T 



m \\9t- 



< y ± Vt L 2 = Y^L 2 <aL 2 VT. 



(16) 



t=T+l 



Next we tackle the terms dependent on D. By the assumption on the diameter, D(x*\\xt) < 
F 2 for all xt- This yields 



Vr+l 



t=T+2 



1 1 



< 



F 2 p2 T+t 



+ — E [Vt^-vt-r-i] 



a a 



t=T+2 



_p2 

a 



i F 

i + Vf-i\ = —Vf 

J a 



(17) 



Here the second to last equality follows from the fact that we have a telescoping sum. Note 
that we can discard the contribution of — Ei^JkEX+i+ll s i nce it is always negative, hence the 
bound can only become tighter. 

Finally, we address the contribution of the inner products between gradients. By the 
Lipschitz property of the gradients we know that (gt- T -j, 9t~r) < L 2 . Moreover, rjt is 
monotonically decreasing, hence we can bound the correlation term in Lemma Q] via 



min(T,t-(r+l)) 

E Vt-j (9t~j,gt-r) < min(i - (r + 1), i)fW(T,t-(r+i))£ 2 - 



(18) 



3=1 



<L2 



Summing over all contributions yields 

T+T 



2t 



T+7 



min(t - (r + 1), t)?? max (r+i,t-(r+i)) = ^2 ~ ( T + 1))Vt+i + E T?? *~ 

t=T+l t=T+l t=2r+l 

2t T^ 



t=T + l 



t=2r+ 



< 



<y T ^ 2 - + 2arVT - r < ^— + 2<tt\/t 



Substituting the bounds for all three terms into the gradient bound yields 



T+i 



R[X]< V {x t - T -x*,g t - T ) <aL 2 VT + F 2 — + L 2 ^— + 2L 2 (jtVt (19) 

t=T + l 



Plugging in a = 7-7=5; changes the RHS to 
R[X]< 



LV2T 

flVt 



+ FLVzTt + FL(t/2)2 +FLv / 27V < FLVzfT 



1 r 
2 + — + 



2r 4\/T. 
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Using the fact that r > 1 (otherwise our analysis is vacuous) and T > r 2 (it is reasonable 
to assume that we have at least O(r) data per processor) yields the claim. ■ 

In other words the algorithm converges at rate 0(V tT). This is similar to what we would 
expect in the worst case: an adversary may reorder instances such as to maximally slow 
down progress. In this case a parallel algorithm is no faster than a sequential code. This 
result may appear overly pessimistic in practice but the following example shows that such 
worst-case scaling behavior is to be expected: 



Lemma 3 Assume that an optimal online algorithm with regard to a convex game achieves 
regret R[m] after seeing m instances. Then any algorithm which may only use information 
that is at least r instances old has a worst case regret bound of tR[i7i/t]. 



Proof The proof is similar to the approach in lMesterharm ( 20051 ) . Our construction works 



by designing a sequence of functions ft where for a fixed n € N all fnr+j are identical (for 
j G {1, . . . , n}). That is, we send identical functions to the algorithm while it has no chance 
of responding to them. Hence, even an algorithm knowing that we will see r identical 
instances in a row but being disallowed to respond to them for r instances will do no better 
than one which sees every instance once but is allowed to respond instantly. Consequently, 
the regret incurred will be r times that of an algorithm seeing m/r instances only once each 
time. ■ 

The useful consequence of Theorem [2] is that we are guaranteed to converge at all even if 
we encounter delay (the latter is not trivial — after all, we could end up with an oscillating 
parameter vector for overly aggressive learning rates). While such extreme cases hardly 
occur in practice, we need to make stronger assumptions in terms of correlation of ft and 
the degree of smoothness in ft to obtain tighter bounds. 

We conclude this section by studying a particularly convenient case: the setting when 
the functions fi are strongly convex with parameter A > satisfying 

fix*) > h(x) + (x* - x,d x f(x)) + ^\\x- x*\\ 2 (20) 
Here we can get rid of the -D(x*||:ei) dependency in the loss bound. 



Theorem 4 Suppose that the functions fi are strongly convex with parameter A > 0. More- 
over, choose the learning rate rjt = x(t-r) f or ^ > r an d 7 lt = for t < r. Then under the 
assumptions of Theorem [D we have the following bound: 



L 2 



R[X] < XtF 2 + [i + r] — (1 + r + logT) 



(21) 



Proof The proof largely follows that of Bartlett et al. ( 20081 ) . The key difference is that 
now we need to take the additional contribution of the gradient correlations into account. 
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Using (|20|) we have 



T+r 



R[X}< ^2 {x t - T - x*,g t - T ) - 2 \\ x t-r 
1 

min(r,rj— (7 

E 



i=r+l 
T+r 

< 

t=T+l 

T+r 

£ E 

i=r+l 
T+r 

£ E 

i=r+l 
T+r 

£ E 



2 



min(r,i-(r+l)) 

r|| 2 + X] Vt-j {gt-T-j,9t-r) + 
3=1 



D(x*\\x t )-D(x*\\x t+1 ) X 



\\Xt-T 



nVt + TJ ]max(i-r,r+l) 
1 

T.m + T^max(t-r,r+l) 



A 



£ + A(t - r) [D(x*||x t ) - Z>(x*||x t+1 )] - - ||s t -r - s 



L 2 + A(t - r) [L>(x*||x t ) - D(x*||x t+1 )] - XD(x*\\x^ T ) 



1 



+ TVma,x(t~T,T+l) 

t=T+l 

+ X(D(x*\\x t )-D(x*\\x t _ T )) 



L 2 + A(t - (r + l))I>(x* ||x t ) - X(t - t)D(x* \\x t+1 ) 



Via telescoping: 



R[X] <((r + 1) - (r + l))Z>(x*||x r+1 ) - ATD(x*||x T+T ) 



T+r 



+ ^A(£>(x*||x T+t )- J D(x*||x t )) + £ 



t=i 



T+7 



< ArF 2 + J- 



t=r+l 



1 



-Tjt + T7? max ( t _ T>r+ i) 



t=T+l 

L 2 



1 



2% + rr ?max(t-r,r+l) 



L 



By construction, 77$ (when t > r + 1) is monotonically increasing, hence we have rjt < 
W(i-r,r+i) , so we can: 



< XtF 2 + 



1 



+ r 



T+r 



7 ?max(t-r,r+l) 



t=r+l 



<XtF + 



<XtF + 



<XtF + 



1 

- + r 
2 



^ 2 Vaxfi^+l) 



i=l 



+ r 



+ r 



T-r 



L 



I? 



(r + l + log(T-r)) 



As before, we pay a linear price in the delay r. 
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4. Decorrelating Gradients 

To improve our bounds beyond the most pessimistic case we need to assume that the 
adversary is not acting in the most hostile fashion possible. In the following we study the 
opposite case — namely that the adversary is drawing the functions fi iid from an arbitrary 
(but fixed) distribution. The key reason for this requirement is that we need to control the 
value of (gt,gt') for adjacent gradients. 

The flavor of the bounds we use will be in terms of the expected regret rather than 
an actual regret. Conve rsions from expected to realized regret are standard. See e.g. 
( Nesterov and Vial . 200C . Lemma 2) for an example of this technique. For this purpose we 



need to take expectations of sums of copies of © in Lemma [TJ Note that this is feasible 
since expectations are linear and whenever products between more than one term occur, 
they can be seen as products which are conditionally independent given past parameters, 
such as {gt,gt') for \t — t'\ < r (in this case no information about g% can be used to infer 
g t i or vice versa, given that we already know all the history up to time min(i, t?) — 1. Our 
informal argument can be formalized by using martingale techniques. We omit the latter 
in favor of a much more streamlined discussion. Since the argument is rather repetitive 
(we will prove a number of different bounds) we will not discuss issues with conditional 
expectations any further. 

A key quantity in our analysis are bounds on the correlation between subsequent in- 
stances. In some cases we will only be able to obtain bounds on the expected regret rather 
than the actual regret. For the reasons pointed out in Lemma [3] this is an in-principle 
limitation of the setting. 

Our first strategy is to assume that ft arises from a scalar function of a linear function 
class. This leads to bounds which, while still bearing a linear penalty in r, make do with 
considerably improved constants. The second strategy makes stringent smoothness assump- 
tions on ft, namely it assumes that the gradients themselves are Lipschitz continuous. This 
will lead to guarantees for which the delay becomes increasingly irrelevant as the algorithm 
progresses. 

4.1 Covariance bounds for linear function classes 

Many functions ft(x) depend on x only via an inner product. They can be expressed as 

f t (x) = l(y t , (z t ,x)) and hence g t (x) = Vf t (x) = z t d( Zux) l(y t , (z t ,x)) (22) 

Now assume that \d/ Zt X \l(yt, (z t ,x})\ < A for all x and all t. This holds, e.g. in the case of 
logistic regression, the soft-margin hinge loss, novelty detection. In all three c ases we have 
A = 1. Robust loss functions such as Huber's regression score ( Huber . 198ll ) also satisfy 



(]22p . although with a different constant (the latter depends on the level of robustness). For 
such problems it is possible to bound the correlation between subsequent gradients via the 
following lemma: 

Lemma 5 Denote by (y,z),(y',z') ~ Pr(y, z) random variables which are drawn indepen- 
dently ofx,x' £ X. In this case 

HW,* [{ d *Kv, {z, x)),d x l(y', (z > , a/)))] < A 2 E z>z , \z'z T ] =: L 2 a (23) 

L J Frob 
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Here we defined a to be the scaling factor which quantifies by how much gradients are 
correlated. 

Proof By construction we may bound the inner product for linear function classes using 
the Lipschitz constant A. This yields the upper bound 



A 2 E ? 



0< 



z, z 



<A 2 



E 



(z,z') 2 



a: 2 



z'z T 



Frob 



Here the first term follows from Lipschitz continuity and the inequality is a consequence of 
the quadratic function being convex. ■ 

We can apply this decorrelation inequality to the previous two learning algorithms. Theo- 
rem^] allows a direct tightening of the guarantees. While the order of the algorithm has not 
improved relative to the worst case setting, we have considerably tighter bounds nonethe- 
less: for instance, for sparse data such as texts the correlation terms are rather small, hence 
the Frobenius norm of the second moment is small as the second moment matrix is diago- 
nally dominant. Generally, ||E [zz t ~\ b < L 2 since the gradient is maximized by having 
maximal value of the gradient of l(y, {z, x)) and an instance of z with large norm. Likewise, 
we may obtain a tighter version of Theorem [2j 



Corollary 6 Given rjt 
update algorithm is bounded by 



lt-r 



and the conditions of Lemma \4-l\ the regret of the delayed 



T GT 2 



R[X] < aL 2 VT + F — — + L 2 a + 2L 2 aarVT 

a 2 



(24) 



and consequently for a* 



F 2 



2ra 



-p- ( assuming that ra > 1) and T > r we obtain the bound 



R[X] < AFLVotT 



(25) 



Proof [sketch only] The proof is identical to that of Theorem [21 except that the terms 
linear and quadratic in r are rescaled by a factor of a. Substituting the new value for a 
and exploiting ar > 1 proves the claim. ■ 



4.2 Bounds for smooth gradients 

The key to improving the rate rather than the constant with regard to which the bounds 
depend on r is to impose further smoothness constraints on ft. The rationale is quite simple: 
we want to ensure that small changes in x do not lead to large changes in the gradient. 
This is precisely what we need in order to show that a small delay (which amounts to small 
changes in x) will not impact the update that is carried out to a significant amount. More 
specifically we assume that the gradient of / is a Lipschitz-continuous function. That is, 

\\Vf t {x)-Vf t (x')\\<H\\x-x'\\. (26) 

Such a constraint effectively rules out piecewise linear loss functions, such as the hinge loss, 
structured estimation, or the novelty detection loss. Nonetheless, since this discontinuity 
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only occurs on a set of measure delayed stochastic gradient descent still works very well 
on them in practice. We need an auxiliary lemma which allows us to control the magnitude 
of the gradient as a function of the distance from optimality: 

Lemma 7 Assume that f is convex and moreover that d x f(x) is Lipschitz continuous with 
constant H. Finally, denote by x* the minimizer of f . In this case 

\\d x f{x)\\ 2 <2H[f{x)-f(x*)]. (27) 

Proof The proof decomposes into two parts: we first show that the problem can be 
reduced to a one-dimensional setting and secondly we show that the claim holds in the 
one-dimensional case. 

Part 1: For a given function / with minimizer x* and for an arbitrary starting point x 
we can simply follow the opposite of the gradient field —d x f(x) starting at x to arrive at x*H 
The parametrized curve corresponding to the gradient flow is still monotonically decreasing, 
its directed gradient equals — ||<9 x /(x)|| along the curve, and moreover, distances between 
points on the curve are bounded from above by the length of the path between them. Hence, 
(|27p holds for the now one-dimensional restriction of /. Note that the derivative along the 
path is strictly negative until the end, what one would expect as one heads to a minimum. 

Part 2: Now assume that / is defined on ]R. Without loss of generality we set x = 
and let x* > 0. Since the gradient cannot vanish any faster than the constraint of (|27j) it 
follows that for all t £ [0, x*] the gradient is bounded from above by 

f'(t) <mm(0j'(0)+Ht) (28) 

Note that by construction, the gradient is strictly negative from (inclusive) to x* (exclu- 
sive), hence the upper bound of zero. Define t* = —f'(Q)/H, such that /'(0) + Ht* = 0. 
Clearly t* £ [0, x*\ since t* > x* would imply that f'(x*) < and x* is not the minimizer. 
Integrating the lower bound on f'(t) yields 



/*X' /*X* ft* 

/(O-/(0)=/ f'(t)dt< mm(0, f'(0) + Ht)dt= [Ht + f(0)]dt 
Jo Jo Jo 



2H 



Multiplying both sides by —2H (and switching the inequality) proves the claim. Note that 
we did not require in the second part of the proof that / is convex or monotonic. This 
information was only used in part 1 to generate the gradient flow. ■ 

This inequality will become useful to show that as we are approaching optimality, the 
expected gradient d x f*{x) also needs to vanish. Since gt is assumed to change smoothly 
with x this implies that in expectation gt will vanish for x — > x* at a controlled rate. We 
now state our main result: 

Theorem 8 In addition to the conditions of Theorem [H assume that the functions fi are 
i.i.d., H > 4 p^p and that H also upper-bounds the change in the gradients as in Lemma^ 



3. This is simply gradient descent and related to the Picard-Lindelof theorem. Without loss of generality 
we define x* to be the particular minimizer of / that is reached by the going in the opposite direction 
of the gradient flow, whenever there is ambiguity in the choice. 
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Moreover, assume that we choose a learning rate rjt = -j= = with a = |j. In this case the 
risk is bounded by 



E[R[X}] < 



28.3F 2 H + -FL + -F H log T 



t 2 + -FLVT. (29) 
3 



Proof Our proof is quite similar to that of Theorem[2j The key differences are that we may 
now bound the expected change between subsequent gradients in terms of the optimality 
gap itself. In particular: 

T T 

EE /*(**)] =E^0*)- (30) 
t=l t=l 

Moreover, observe that: 

f (X*) = mmEihix)} > E[min/i(aO] (31) 

X X 

Moving the minimum inside the expectation makes it so that we can decide which point 
after we know what function is drawn, as opposed to before, which makes the problem easier 
and the expected cost lower. Note that Ylt=i fti because it is a sum of random functions 
with mean /*, is itself a random function with mean Tf* . So the same reasoning applies: 

T 

Tf*(x*) >E[minE /*(*)] (32) 
t=i 

Therefore: 

T T 
B[R[X}} = E[maxE ft{x t ) - f t {x')\ > E[^ [/*(**) " /*(>*)]]■ (33) 
x t=i t=i 

We can extract from the proof of Theorem [2] that: 

F 2 pf T+t nun(r,t-(T+l)) 

R[X] <aL 2 VT+^- + V T m-jigt-r-j^t-r) (34) 
o~ 1 — ^ — ' 

t=T+l j=l 

f 2 Ft t+t min ( T ' <_ ( r+1 )) 

B[R[X}} <aL 2 VT+—^—+ E E Vt-j^Kat-T-j^t-r)} (35) 

t=T + l 3=1 

Consider the gradient correlation of (]18p for t > r 

T 

Ct := E Vt-i (9t-T-j,9t-r) ■ (36) 

We know that \\xt — Xf || < L Y^j=t 'Hj s i nce each gradient is bounded by L. By the smooth- 
ness constraint on the gradients this implies that \\d x (fi(xt) — fi(xt>))\\ < LH Ylj=t Vj- 
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This means that as rjt — > the error induced by the delayed update become a second order 
effect as the algorithm converges. In summary, we may bound Ct as follows: 

T 

Ct =y]rtt-j (V 'ft-T-j(x t - T -j),X/ 'f t -r{xt-r)) 
3=1 

T 

= ^Vt-j {Vft-r-j{xt-r)^ft-r{x^ T )) + (37) 
3=1 

T 

^Vt-j (V ' ft- T -j(xt-T-j) - Vft-T-j(xt-r),^ft-T(xt- T )) 
3=1 

T 

-^j^t-j [(Vft-T-j(xt- T ), V/t_ T (x t _ T )) + jrit^2rL 2 H} 

3=1 

Taking expectations of the upper bound is feasible, since all ft-r-j and /t_ T are independent 
of each other and of their argument xt- T . This yields the upper bound 

T 

E[C t ]< J>_,- 

3=1 

< 2t]^ t tH \f{xt-r) - f*(x*)} + r, 2 _ 2r T 2 HL 2 (39) 

The second inequality is obtained by appealing to Lemma [7] and by using the fact that the 
learning rate is monotonically decreasing. 

What this means is that once the stepsize of the learning rate is small enough, second 
order effects become essentially negligible. The overall reduction in the amount by which 
the bound on the expected regret f*(xt) — f*(x*) is reduced is given by THr/t~ T . If we 
wish to limit this reduction to \ this implies for a learning rate of rjt = aj ' \Jt — r that we 
should use ((39]) only for t > t := 3r + 6Aa 2 T 2 H 2 < \\2a 2 T 2 R 2 (the latter bounds holds by 
assumption on H). We now bound the part of the risk where the effects of the delay are 
sufficiently small. In analogy to (|15p we obtain 

F 2 \Ff t+t min ( T ' < ^ T+1 )) 
E[R[x}} <aL 2 VT+^— + V V r)t-jK[(9t-T-j,gt-T)\ 

(j £ / Z ✓ 

t=T+l j = l 

F 2 Ff *o nain(r,t-(r+l)) T+t 

B[R[x]]<aL 2 Vf+^—+ E vt^n(gt-T-j,gt-T)]+ B ^ 

t=T+l j = l t=t + l 

E[R[x}}<aL 2 VT + ^^ + L 2 ^ + 2L 2 arVh+ V E[C t ] 

t=t Q +i 

All but the last term can be bounded in the same way as in Theorem (2) The sum over 
the gradient norms can be bounded from above by aL 2 y/T as in (|16p . Likewise, the sum 
over the divergences can be bounded by ^-VT as in (|17|) . Lastly, since 2HTr\t- T < \ when 
t > to and f*{xt- T ) — f*(x*) > 0, the sum over the gradient correlations is bounded as 



||Vf (x t _ T )f+j % 



-2 T L 2 H 



(38) 
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follows 



E 



E 



E 



T+T 

,t=*o+l 

T+T 

E <* 

t=t(i+l 

T+T 

E Q 

,t=t +l 



T+T 



< ^E[/*(x t _ T ) -/*(**)] + J] L 2 Hrj 2 _ 2T T 



t=t +l 



T+7 

E 

i=io+l 



<L 2 t 2 o- 2 H1o S T 



T 1 

< ^ -E \f(x t ) - f*(x*)] + L 2 r 2 a 2 H\ogT 



t=l 



<-E[i?[x]] + L 2 r 2 a 2 H log T 



(40) 

(41) 
(42) 



For bounding the sum over ?y 2 we used a conversion of the sum to an integral and the fact 
that to > r + 1. This first term equals the expected regret over the time period [to, T + r]. 
Hence, multiplying the overall regret bound by 1/(1 — 1/4) = 4/3 and combining the sum 
over [to, T] with Theorem [2] (which covers the segment [r, to]) yields the following guarantee: 



' F 2 1 

\E[R[X]} <aL 2 VT+ — y/T+ -L 2 ar 2 + 2L 2 ar^ + L 2 r 2 a 2 H log T 



(43) 



Plugging in a = F/L and to < 112F 2 t 2 H 2 / L 2 , using the fact that r > 1, and collecting 
terms yields 



-B[R[X}} < 



2\.2F 2 H + -FL + F 2 H log T 
2 



r 2 + 2FLVT. 



(44) 



Dividing by | proves the claim. 



Note that the convergence bound which is 0(t 2 log T + vT) is governed by two different 
regimes. Initially, a delay of r can be quite harmful since subsequent gradients are highly 
correlated. At a later stage when optimization becomes increasingly an averaging process a 
delay of r in the updates proves to be essentially harmless. The key difference to bounds of 
Theorem [2] is that now the rate of convergence has improved dramatically and is essentially 
as good as in sequential online learning. Note that H does not influence the asymptotic 
convergence properties but it significantly affects the initial convergence properties. 

This is exactly what one would expect: initially while we are far away from the solution 
x* parallelism does not help much in providing us with guidance to move towards x* . 
However, after a number of steps online learning effectively becomes an averaging process 
for variance reduction around x* since the stepsize is sufficiently small. In this case averaging 
becomes the dominant force, hence parallelization does not degrade convergence further. 
Such a setting is desirable — after all, we want to have good convergence for extremely 
large amounts of data. 



4.3 Bounds for smooth gradients with strong convexity 

We conclude this section with the tightest of all bounds — the setting where the losses are 
all strongly convex and smooth. This occurs, for instance, for logistic regression with £2 
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regularization. Such a requirement implies that the objective function f*(x) is sandwiched 
between two quadratic functions, hence it is not too surprising that we should be able to 
obtain rates comparable with what is possible in the minimization of quadratic functions. 
Also note that the ratio between upper and lower quadratic bound loosely corresponds to 
the condition number of a quadratic function — the ratio between the largest and smallest 
eigenvalue of the matrix involved in the optimization problem. 

The analysis is a combination of the proof techniques described in Theorem [8] in com- 
bination with Theorem HI 



Theorem 9 Under the assumptions of Theorem [7| in particular, assuming that all func- 
tions fi are i.i.d and strongly convex with constant X and corresponding learning rate 
r\t = A(t-r) an< ^ P rov, ided that the loss satisfies \21^ for some constant H we have the 
following bound on the expected regret: 



E[R[X}}< 



10 



XtF 2 + 



1 



+ T 



— [1 + r + log(3r + (Ht/X))} + — [1 + log 71 + ^ 

(45) 



Proof As before, we bound the expected correlation between gradients via 



O IT 2 TJ T 2 

E[C t ] < - , lf*(x t - T ) - f*(x*)] + -r- - ^ hence (if to > 3r) 



T+T 



< 



X(t - 2r) 

7T 2 T 2 HL 2 



X 2 (t - 3r) 2 



2tH 



t=t +i 



6A 2 



X(t - 2r + 



- Yl E[/*(x t )-/*(x*)] 



t=t + l-T 



Here the second inequality follows from the fact that the learning rate is decreasing and 
by the fact that X^nLi = \- This allows us to combine both a bound governing the 
behavior until to an d a tightened-up bound once gradient changes are small. We obtain 



2tH 



X(t - 2r + 1) 



E [R[X]} < XtF 2 + 



+ r 



L\ , , L 2 r , , tt 2 t 2 HL 2 

-[l + r + logt ] + -[l + logT] + ^^ 



By choosing to = 3r + (Ht/X) we see that the factor on the LHS is bounded by 0.9. This 
also simplifies expressions on last term of the RHS and it yields the inequality 



0.9E [R[X\] < XtF 2 + 



1 

2 +T 



T 2 T 2 7T 2 T 2 J~f J 2 

-[1 + t + log(3r + (Ht/X))} + — [1 + logT] + * ^ . 

(46) 



Dividing by 0.9 proves the claim. ■ 

As before, this improves the rate of the bound. Instead of a dependency of the form 
0(t log T) we now have the dependency 0(t 2 + logT). This is particularly desirable for 
large T. We are now within a small factor of what a fully sequential algorithm can achieve. 
In fact, we could make the constant arbitrary small for large enough T. 
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5. Bregman Divergence Analysis 

We now gen eralize Algorithm [j] to Bregman d ivergences. In particular, we use the proof 



technique of (jShalev-Shwartz and Singed . 120071 . Section 3.1). We begin by introducing Breg- 



man divergences and strong convexity. Denote by <p : S — > R a convex function. Then the 
(^-divergence between x, x' 6 ¥> is defined as 

D^(x\\x') = 4>(x) - 4>(x') - (x - x',V4>{x')) (47) 

Moreover, a convex function / is strongly u-convex with respect to <j) whenever the following 
inequality holds for all x, x' £ r B: 

f{x) - f(x') -{x- x', Vf(x')) > oD^xWx'). (48) 

Finally, for a convex function / denote by /* the Fenchel-Legendre dual of /. It is given 
by f*{y) = sup.,, (x,y) — f(x). We are now able to define the implicit update version of 
Algorithm [TJ It is easy to check that Algorithm [1] is a special case of Algorithm [2j For 

Algorithm 2 Delayed Stochastic Gradient Descent with Implicit Updates 
Input: scalar a > 0, delay r G N and convex function (j). 
Set x\ . . . , x T = and compute corresponding g t = V ft(xt). 
for t = t + 1 to T + r do 

Obtain ft and incur loss ft(xt) 

Compute gt := Vft(xt) and set rj t = J^_ T 

Update x t +i = Vcfi* (Vtp(x t ) - »- T ) 
end for 



I ||x|| 2 we have that <p* = (j) and V(j)(x) = x. If (p is the unnormalized logarithm we 



obtain delayed exponential gradient d escent. We state the following lem ma without proof, 
since it is virtually identical to that of IShalev-Shwartz and Singer J2OO7I): 



Lemma 10 Assume that <j) is 1-strongly convex with respect to the norm associated with 
r B. Then for any x* £ H, and in particular the loss minimizer, the following holds 

1 * v ^ D<t>(x*\\x t ) -Z^(x*||x m ) 1 2 f Aci\ 

(x t -x ,g t _ T ) < — v - + r] t - \\g t -r L ( 49 

Vt 2 

Theorem 11 Assume that the implicit updates associated with (j) are Lipschitz, that is 

||V0*(V0(a;) -x') -x\\ < &\\a/\\ (50) 
for some $ > 0. Then the delayed update algorithm has a regret bound of the form 

R[X] < o-L 2 VT + F 2 — + L 2 $— + 2L 2 $(ttVt (51) 
a 2 

and consequently for a 2 = 27 -$^ ( assuming that t& > 1) and T > r 2 we obtain the bound 

R[X] < 4FLV$tT (52) 
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Proof To apply the regret bounds we need to replace (xt — x*,gt- T ) in (j4*9j) by a term 
which uses xt- T instead of xt- This can be achieved by telescoping via 

T-l 

(x t - x*,g t - T ) = {x t -T ~ x*,g t - T ) + ^ (xt-j ~ %t-j-i,gt-r) (53) 

The key difference to before is that now the difference between subsequent weight vectors 
does not constitute the gradient anymore. To obtain the same type of bounds that yielded 
Theorem [2] we exploit continuity in the forward and reverse transform via (|50p . This yields 
(xt — x*,gt- T ) > {xt-T — x*,9t-r) — T7] t - T &L 2 . Plugging this bound into a sum over T 
terms and using the argument as in Theorem [2] proves the claim. ■ 

Obtaining bounds that are as tight as Theorem [8] is subject of further work. We anticipate, 
however, that this may not be quite as easy, in particular whenever functions can change 
significantly after just seeing a small number of examples, as is the case for exponenti- 
ated gradient descent. Here a delay can be considerably more harmful than in the simple 
stochastic gradient descent scenario. 



6. Experiments 



In our experiments we focused on pipelined optimization. In pa rticular, we used two differ- 
ent training sets that were based on e-mails: the TREC dataset (jCormackl . 120071 ) . consisting 
of 75,419 e-mail messages, and a proprietary (significantly harder) dataset of which we took 
100,000 e-mails. These e-mails were tokenized by whitespace. The problem there is one of 
binary classification, that is we are interested in minimizing 



ft{x) = l(yt {z t ,x)) where 



2 -X 





if X <0 
if X € [0, 1] 
otherwise 



(54) 



Here y± € {±1} denote the la bels of the binary class ification problem, and I is the smoothed 
quadratic soft-margin loss of Langford et al. ( 20071 ). We used two feature representations: 
a linear one which amounted to a simple bag of words representation, and a quadratic one 
which amounted to generating a bag of word pairs (consecutive o r not). 



20091 ) . 



To deal with high-dimensional feature spaces we used hashing ([Weinberger et al. 
In particular, for the TREC dataset we used 2 18 feature bins and for the proprietary dataset 
we used 2 24 bins. Note that hashing comes with performance guarantees which state that 
the canonical distortion due to hashing is sufficiently small for the dimensionality we picked. 

We tried to address the following issues in our simulation: 

1. The obvious question is a systematic one: how much of a convergence penalty do we 
incur in practice due to delay. This experiment checks the goodness of our bounds. We 
checked convergence for a system where the delay is given by r G {0, 10, 100, 1000}. 

2. Secondly, we checked on an actual parallel implementation whether the algorithm 
scales well. Unlike the previous check includes issues such as memory contention, 
thread synchronization, and general feasibility of a delayed updating architecture. 
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Performance on TREC Data 
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Figure 2: (a) Experiments with simulated delay on the TREC dataset (b) Experiments 
with simulated delay on the (harder) proprietary dataset (c) Time performance 
on a subset of the TREC dataset which fits into memory, using the quadratic 
representation. There was either one thread (a serial implementation) or 3 or 
more threads (master and 2 or more slaves). 



Implementation The code was written in Java, although several of the fundamentals 
were based upon VW (jLangford et all l2007h , that is, hashing and the choice of loss function. 
We added regularization using lazy updates of the parameter vector (i.e. we rescale the 
updates and occasionally rescale the parameter). This is akin to Leon Bottou's SGD code. 
For robustness, we used Vt = ^q- 

All timed experiments were run on a single, 8 core machine with 32 GB of memory. In 
general, at least 6 of the cores were free at any given time. In order to achieve advantages of 
parallelization, we divide the feature space {1 . . . n} into roughly equal pieces, and assign a 
slave thread to each piece. Each slave is given both the weights for its pieces, as well as the 
corresponding pieces of the examples. The master is given the label of each example. We 
compute the dot product separately on each piece, and then send these results to a master. 
The master adds the pieces together, calculates the update, and then sends that back to the 
slaves. Then, the slaves update their weight vectors in proportion to the magnitude of the 
central classifier. What makes this work quickly is that there are multiple examples in flight 
through this dataflow simultaneously. Note that between the time when a dot product is 
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calculated for an example and when the results have been transcribed, the weight vector has 
been updated with several other earlier examples and the dot products have been calculated 
from several later examples. As a safeguard we limited the maximum delay to 100 examples. 
In this case the compute slave would simply wait for the pipeline to clear. 

The first experiment that we ran was a simulation where we artificially added a delay 
between the update and the product (Figure [2h). We ran this experiment using linear 
features, and observed that the performance did not noticeably degrade with a delay of 10 
examples, did not significantly degrade with a delay of 100, but with a delay of 1000, the 
performance became much worse. 

The second experiment that we ran was with a proprietary dataset (Figure [2b) . In this 
case, the delays hurt less; we conjecture that this was because the information gained from 
each example was smaller. In fact, even a delay of 1000 does not result in particularly bad 
performance. 

Encouraged by these results, we tried to parallelize these exact experiments (results 
not shown). This turned out to be impossible: a serial implementation alone handled over 
150,000 examples/second. However, when you consider more complex problems, such as 
with a quadratic representation, then a single example takes slightly above one millisecond. 
In this domain, we found that parallelization dramatically improved performance (Fig- 
ure [2b). In this case, we loaded a small number of examples that could fit into memoryo 
and showed that the parallelization improved speed dramatically. 

7. Summary and Discussion 

Trying the type of delayed updates presented here is a natural approach to the problem: 
however, intuitively, having a delay of r is like having a learning rate that is r times larger. 
In this paper, we have shown theoretically how independence between examples can make 
the actual effect much smaller. 

The experimental results showed three important aspects: first of all, small simulated 
delayed updates do not hurt much, and in harder problems they hurt less; secondly, in 
practice it is hard to speed up "easy" problems with a small amount of computation, such 
as e-mails with linear features; finally, when examples are larger or harder, the speedups 
can be quite dramatic. 



4. ideally, one could design code optimized for quadratic representations, and never explicitly generate the 
whole example 
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